Finding Similarities in Source Code Through Factorization
نویسندگان
چکیده
The high availability of a huge number of documents on the Web makes plagiarism very attractive and easy. This plagiarism concerns any kind of document, natural language texts as well as more structured information such as programs. In order to cope with this problem, many tools and algorithms have been proposed to find similarities. In this paper we present a new algorithm designed to detect similarities in source codes. Contrary to existing methods, this algorithm relies on the notion of function and focuses on obfuscation with inlining and outlining of functions. This method is also efficient against insertions, deletions and permutations of instruction blocks. It is based on code factorization and uses adapted pattern matching algorithms and structures such as suffix arrays.
منابع مشابه
Viewing functions as token sequences to highlight similarities in source code
The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This latter can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining ...
متن کاملSupervised Matrix Factorization for Cross-Modality Hashing
Matrix factorization has been recently utilized for the task of multi-modal hashing for cross-modality visual search, where basis functions are learned to map data from different modalities to the same Hamming embedding. In this paper, we propose a novel cross-modality hashing algorithm termed Supervised Matrix Factorization Hashing (SMFH) which tackles the multi-modal hashing problem with a co...
متن کاملCode Similarity Detection in Multiple Large Source Trees using Token Hashes
The ability to find similarities between two source code bases, or within one code base, has many uses including the detection of student plagiarism, the identification of intellectual property violations and the location of repeated code in a code base amenable to refactoring. Previous structure-metric approaches have used either suffix trees or modified Longest Common Subsequence algorithms t...
متن کاملTriple factorization of non-abelian groups by two maximal subgroups
The triple factorization of a group $G$ has been studied recently showing that $G=ABA$ for some proper subgroups $A$ and $B$ of $G$, the definition of rank-two geometry and rank-two coset geometry which is closely related to the triple factorization was defined and calculated for abelian groups. In this paper we study two infinite classes of non-abelian finite groups $D_{2n}$ and $PSL(2,2^{n})$...
متن کاملA Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Electr. Notes Theor. Comput. Sci.
دوره 238 شماره
صفحات -
تاریخ انتشار 2009